In the last lesson, we learned how to pull data from an API and
reduce redundancies in our workflows through functions and iterations.
In this lesson we will use the functions in the previous lesson to learn
how to manipulate data frames with the tidyverse, plot
elegant time series graphs with ggplot(), and the
scales and plotly packages.
There are five exercises in this lesson that must be completed.
# includes the tidyverse, rmarkdown, httr, jsonlite, scales, and plotly
source("setup.R")
Using the parkwide_visitation() function from the last
lesson and mapping, let’s pull park-wide visitor data from 1980-2021.,
and name the final object parkwide. (Code hack: we can use
1980:2021 to create a vector of years so we don’t have to
write each year out!)
parkwide_visitation <- function(year){
raw_data <- httr::GET(url =
paste0("https://irmaservices.nps.gov/v3/rest/stats/total/", year))
extracted_data <- httr::content(raw_data, as = "text", encoding = "UTF-8")
final_data <- jsonlite::fromJSON(extracted_data)
return(final_data)
}
years <- (1980:2021)
parkwide <- years %>% map(~ parkwide_visitation(year = .)) %>% bind_rows()
Using the unit_visitation() function from the
last lesson and mapping, pull visitor data from 1980-2021 for the
following park units: ROMO, ACAD, LAKE, YELL, GRCA, ZION, OLYM, and
GRSM. Name the final output units.
unit_visitation <- function(park, start_month = "01", start_year, end_month = "12", end_year){
raw_data <- httr::GET(url = paste0("https://irmaservices.nps.gov/v3/rest/stats/visitation?unitCodes=", park,
"&startMonth=", start_month,
"&startYear=", start_year,
"&endMonth=", end_month,
"&endYear=", end_year))
# convert content to text
extracted_data <- httr::content(raw_data, as = "text", encoding = "UTF-8")
# parse text from JSON to data frame
final_data <- jsonlite::fromJSON(extracted_data)
return(final_data)
}
parks <- c("ROMO", "ACAD", "LAKE", "YELL", "GRCA", "ZION", "OLYM", "GRSM")
units <- parks %>% map(~ unit_visitation(park = ., start_year = '1980', end_year = '2021')) %>%
bind_rows()
Look at the data frame structure of parkwide and
units; they’re exactly the same! So let’s go ahead and bind
those together:
visitation <- bind_rows(parkwide, units)
… except, the rows in parkwide’s UnitCode and UnitCode
columns are empty. 😑 Let’s fix the UnitCode column to list
“Parkwide” using mutate() and an ifelse()
statement:
visitation <- visitation %>% mutate(UnitCode = ifelse(is.na(UnitCode), "Parkwide", UnitCode))
Think of the above ifelse() operation as: “If the column
UnitCode is NA, replace NA with
Parkwide. Otherwise, preserve what is already in the
UnitCode column.”
Now that we have a single data set containing all of the NPS
visitation data that we’ve pulled, let’s start exploring it! But first,
let’s aggregate the monthly data into annual data using
group_by() and summarize():
yearly <- visitation %>%
group_by(UnitCode, Year) %>%
summarize(TotalVisitation = sum(RecreationVisitors))
yearly
What does visitation data look like through time? First we can try to graph all of the park units together:
ggplot(data=yearly)+
geom_point(aes(x = Year, y = TotalVisitation, color = UnitCode)) +
geom_path(aes(x = Year, y = TotalVisitation, color = UnitCode)) +
scale_y_continuous(labels = scales::label_scientific()) +
theme_bw(base_size=10)
… yikes, not surprisingly, parkwide visitation is wayyyy higher than our individual unit’s visitation data, making our graph pretty useless. It might be nice to have each park unit in a graph of its own.
We can create individual graphs for each unit using
facet_wrap(), and set the y-axes for each plot to
"free_y":
ggplot(data=yearly) +
geom_point(aes(x = Year, y = TotalVisitation, color = UnitCode)) +
geom_path(aes(x = Year, y = TotalVisitation, color = UnitCode)) +
scale_y_continuous(labels = scales::label_scientific()) +
facet_wrap(~UnitCode, scales = "free_y") +
theme_bw(base_size=10)
We can also make this plot interactive by feeding it into
plotly’s ggplotly() function:
plotly::ggplotly(
ggplot(data=yearly) +
geom_point(aes(x = Year, y = TotalVisitation, color = UnitCode)) +
geom_path(aes(x = Year, y = TotalVisitation, color = UnitCode)) +
scale_y_continuous(labels = scales::label_scientific()) +
facet_wrap(~UnitCode, scales = "free_y") +
theme_bw(base_size=10)
)
Create an interactive graph with two separate panes: one showing park-wide visitation, the other showing all the individual park units together. Both panes should have different y-axes.
yearly <- yearly %>%
mutate(group = ifelse(UnitCode == "Parkwide", "Parkwide", "Unit"))
plotly::ggplotly(
ggplot(data=yearly)+
geom_point(aes(x = Year, y = TotalVisitation, color = UnitCode)) +
geom_path(aes(x = Year, y = TotalVisitation, color = UnitCode)) +
scale_y_continuous(labels = scales::label_scientific()) +
facet_wrap(~group, scales = "free_y") +
theme_bw(base_size=10)
)
It is pretty clear that some park units get orders of magnitude more visitors than others. But just how much of the total park visitation do each of these parks account for from year to year? Here we walk through two methods to tackle this question, pivoting and joining, to get park unit visitation side-by-side with park-wide data.
Currently, our yearly data is considered narrow because we
have all of our NPS visitation data in one column, with multiple rows
representing the same year. We can make this data wide by using
the function pivot_wider()
wide_data <- yearly %>%
select(Year, UnitCode, TotalVisitation) %>%
pivot_wider(., names_from = UnitCode, values_from = TotalVisitation)
… where names_from represents the column with the values
you are hoping to spread into new columns, and values_from
represents the data you want to fill these new columns with.
We can make the data set narrow again by using the function
pivot_longer():
narrow_data <- wide_data %>%
pivot_longer(cols = -Year,
names_to = "Park",
values_to = "TotalVisitation")
… where cols are the columns we want to gather into one
column (or, the column(s) you DON’T want to gather), while
names_to and values_to are the names of the
new columns produced from the pivot.
Using wide_data as the starting point, create an
interactive time series plot showing the annual percentage of the total
visitation made up by all park units.
fracts <- wide_data %>%
# GO OVER THIS IN VIDEO...
# BASICALLY THEY WILL GET THE QUICK APPROACH TO THE ANSWER
# ... IF THEY PAID ATTENTION TO THE VIDEO.
mutate_at(.vars = parks, .funs = ~ (./Parkwide)*100) %>%
pivot_longer(cols = -c(Parkwide,Year), names_to = "Park", values_to = "Percentage")
plotly::ggplotly(
ggplot(fracts) +
geom_line(aes(x=Year, y=Percentage, color = Park)) +
theme_bw(base_size=10)
)
Another way of getting park-wide visitation side-by-side with the
park unit data is through the use of joining our original
units and parkwide data sets:
joined_data <- inner_join(x = units, y = parkwide, by = c("Year","Month"))
… where x and y are the two data sets you
want joined, and by indicates the column(s) to match them
by. Note: there are several ways of joining data. Explore them with
?`mutate-joins` and ?`filter-joins`.
Using joined_data as the starting point, create
an interactive time series plot showing the annual percentage of the
total visitation made up by all park units. This plot should look nearly
identical to the previous plot.
fracts <- joined_data %>%
group_by(Year, UnitCode.x) %>%
summarize(ParkVisitors = sum(RecreationVisitors.x),
TotalVisitors = sum(RecreationVisitors.y)) %>%
mutate(Percentage = (ParkVisitors/TotalVisitors) * 100)
plotly::ggplotly(
ggplot(fracts) +
geom_line(aes(x=Year, y=Percentage, color = UnitCode.x)) +
theme_bw(base_size=10)
)
Which park on average has the most visitation? Which park has the least visitation? Base your response on the data starting in 1990, ending in 2021. Defend your answer with numbers!
one_way <- fracts %>%
filter(Year>=1990) %>%
group_by(UnitCode.x) %>%
summarize(mean = mean(Percentage))
other_way <- yearly %>%
filter(Year >= 1990) %>%
group_by(UnitCode) %>%
summarize(mean = mean(TotalVisitation))